In this section, we’re going to use California state data to build various models and compare their performance.
The dataset has already been properly partitioned, so now let’s build some models.
library(tidyverse)
library(kableExtra)
library(scales)
library(caret)
library(modelr)
library(ROSE)
library(glmnet)
library(rpart)
library(rpart.plot)
library(randomForest)
library(plotly)
Just one more step before we start building models. By grouping the 4 severity levels into 2 levels, now the dataset is more balanced in severity levels. However, from the plot below, we can see the records in each severity level are still not equal.
This is not a big issue actually, but with a balanced data we can better train the model and interpret the final accuracy more easily (both sensitivity and specificity need to be high to gain a higher total accuracy). So, let’s apply some sampling techniques to make the data balanced.
Here, both oversampling and undersampling are used to make the data balanced. Also, by applying sampling techniques, we reduce the data size to a scale that is more easily to manipulate.
new_train <- ovun.sample(Status ~ .,
data = train_set,
method = "both", p = 0.5, N = 90000, seed = 1)$data %>% as_tibble()
Since our response variable has 2 levels now, “Severe” and “Not Severe”, it’s reasonable that we choose logistic regression as our base line model.
To gain the best formula for logistic regression, we can apply the stepwise model selection process. Here, we cannot use the root-mean-square as our criterion, instead we can use some statistical values, like AIC or BIC. In general, BIC is more strict about variables, the final formula based on BIC will contain less perdictors than using AIC. But there is no absolute conclusion saying which one is the best. So, we just use AIC value here.
# use step() to apply stepwise model selection
model_aic <- glm(Status ~ ., data = new_train, family = "binomial")
model_aic <- step(model_aic)
These variables are dropped:
| Vaiables to drop | AIC |
|---|---|
| Day | 101128.8 |
| Temperature | 101126.8 |
| Sunrise_Sunset | 101124.8 |
| Nautical_Twilight | 101122.9 |
| Astronomical_Twilight | 101121.1 |
| Civil_Twilight | 101119.9 |
| Humidity | 101118.7 |
The final formula based on AIC value:
## glm(formula = Status ~ TMC + Year + Month + Hour + Wday + Duration +
## Start_Lat + Start_Lng + Distance + Side + Pressure + Wind_Speed +
## Weather_Condition + Junction + Traffic_Signal, family = "binomial",
## data = new_train)
Make predictions on validation dataset. Here, we choose 0.6 as the cutoff (transform probability to response variable levels) to gain a higher total accuracy.
We can see the performance of logistic regresson by using confusion matrix.
| Accuracy | Sensitivity | Specificity | Positive term |
|---|---|---|---|
| 0.7154623 | 0.7352326 | 0.6754223 | Not Severe |
## Confusion Matrix and Statistics
##
##
## Not Severe Severe
## Not Severe 64002 13951
## Severe 23048 29031
##
## Accuracy : 0.7155
## 95% CI : (0.713, 0.7179)
## No Information Rate : 0.6695
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3898
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7352
## Specificity : 0.6754
## Pos Pred Value : 0.8210
## Neg Pred Value : 0.5574
## Prevalence : 0.6695
## Detection Rate : 0.4922
## Detection Prevalence : 0.5995
## Balanced Accuracy : 0.7053
##
## 'Positive' Class : Not Severe
##
From the result above, we can see the performance of normal logistic regression is not satisfying. Let’s try a different model.
Since we have so many variables in our dataset, it’s possible that some of the variables’ coefficients may be near zero in the final best model. So, we decide to try sparse logistic model next.
Sparse logistic regression uses a “Lasso” penalty: as the tuning parameter \(\lambda\) increases, it’ll force more variables to have coefficient zero. From the plot below, we can see the change of variables’ coefficient.
x <- model.matrix(Status ~ ., data = new_train)
model_total <- glmnet(x, new_train$Status, family = "binomial")
plot(model_total)
To get the best sparse logistic model, we need to find the best tuning parameter \(\lambda\). Cross validation is used to find the best \(\lambda\) here.
model_lambda <- cv.glmnet(x, new_train$Status, family = "binomial")
plot(model_lambda)
With the best tuning parameter \(\lambda\), we then build the model and make predictions on the validation dataset. We also set the cutoff as 0.6 to gain a higher total accuracy.
| Accuracy | Sensitivity | Specificity | Positive term |
|---|---|---|---|
| 0.7138375 | 0.7277552 | 0.6856505 | Not Severe |
## Confusion Matrix and Statistics
##
##
## Not Severe Severe
## Not Severe 63354 13512
## Severe 23700 29472
##
## Accuracy : 0.7138
## 95% CI : (0.7114, 0.7163)
## No Information Rate : 0.6695
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.39
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7278
## Specificity : 0.6857
## Pos Pred Value : 0.8242
## Neg Pred Value : 0.5543
## Prevalence : 0.6695
## Detection Rate : 0.4872
## Detection Prevalence : 0.5911
## Balanced Accuracy : 0.7067
##
## 'Positive' Class : Not Severe
##
Compared to the previous normal logistic model, it seems the performance is similar. So, sparse logistic regression cannot make better predictions in this case. We still need to explore other models.
Some algorithms based on trees can do a good job in classification. Also, these algorithms have a built in feature selection process, which means we don’t need to be very picky about our variables.
Next, we are going to try decison trees, a very useful algorithm with a readily comprehensible concept.
model_decision <- rpart(Status ~ ., data = new_train, method = "class", minsplit = 20, cp = 0.001)
Usually we can plot the decision tree to see all the nodes. But here to reach a higher accuracy, we have to take many variables into account (set cp = 0.001), which makes the final tree quite complicated and cannot be easily plotted.
rpart.plot(model_decision, box.palette = "RdBu", shadow.col = "grey", )
After we build the tree, let’s make predictions on the validation dataset.
| Accuracy | Sensitivity | Specificity | Positive term |
|---|---|---|---|
| 0.8525123 | 0.8523101 | 0.852922 | Not Severe |
## Confusion Matrix and Statistics
##
##
## Not Severe Severe
## Not Severe 74197 6322
## Severe 12857 36662
##
## Accuracy : 0.8525
## 95% CI : (0.8506, 0.8544)
## No Information Rate : 0.6695
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6791
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8523
## Specificity : 0.8529
## Pos Pred Value : 0.9215
## Neg Pred Value : 0.7404
## Prevalence : 0.6695
## Detection Rate : 0.5706
## Detection Prevalence : 0.6192
## Balanced Accuracy : 0.8526
##
## 'Positive' Class : Not Severe
##
From the result above, we can see decision tree really gives a better performance than the previous two logistic regression model. What’s more, it takes much less time to train a decision tree than logistic models.
So far, decision tree is the best model we have.
As we all know, decision tree has an obvious disadvantage (not quite obvious here though) that it may have a high accuracy on training dataset but a much lower accuracy on test dataset, which is the result of overfitting.
And random forest can alleviate this overfitting effect by applying a special sampling technique called “bootstrapping”. By analyzing the final out-of-bag error rate, a more practical model can be obtained.
Let’s see if random forest can imporve the accuracy even better.
model_rf <- randomForest(Status ~ ., data = new_train, mtry = 6, ntree = 500)
These two arguments here are very important:
| Name | Description |
|---|---|
| mtry | Number of variables randomly sampled as candidates at each split |
| ntree | Number of trees to grow |
To train a better random forest model, we need to find proper values for these two arguments.
Use the plot below to see whether ntree = 500 is enough:
As we can see, as the number of trees increases, the error rate tends to be a fixed number. So, ntree = 500 is enough.
mtry:
It’s obvious that mtry = 18 should be the best input.
Next, we use ntree = 500 and mtry = 18 to build the random forest model and make predictions on validation dataset.
| Accuracy | Sensitivity | Specificity | Positive term |
|---|---|---|---|
| 0.8849106 | 0.870184 | 0.9147357 | Not Severe |
## Confusion Matrix and Statistics
##
##
## Not Severe Severe
## Not Severe 75753 3665
## Severe 11301 39319
##
## Accuracy : 0.8849
## 95% CI : (0.8832, 0.8866)
## No Information Rate : 0.6695
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7511
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8702
## Specificity : 0.9147
## Pos Pred Value : 0.9539
## Neg Pred Value : 0.7767
## Prevalence : 0.6695
## Detection Rate : 0.5825
## Detection Prevalence : 0.6107
## Balanced Accuracy : 0.8925
##
## 'Positive' Class : Not Severe
##
According to the result above, random forest does improve the accuracy compared to decision tree. However, the time consumed by training and finding the best random forest model is tremendously longer than training a decent decision tree model.
In conlusion, considering both performance and the time needed to train the model, I prefer using decision tree to make predictions. But if we care about nothing but accuracy, then I suppose random forest will be the winner.
The table below contains the performance of each model:
| Model | Accuracy | Sensitivity | Specificity |
|---|---|---|---|
| Logistic Regression | 0.7154623 | 0.7352326 | 0.6754223 |
| Sparse Logistic Regression | 0.7138375 | 0.7277552 | 0.6856505 |
| Decision Tree | 0.8525123 | 0.8523101 | 0.8529220 |
| Random Forest | 0.8849106 | 0.8701840 | 0.9147357 |